List of used packages: knitr dplyr ggplot2 broom reshape2 janitor plm pwt9 quarto renv shiny targets testthat tidyverse tibble usethis rio lubridate purrr Hmisc plotly hrbrthemes xts seasonal tsbox forecast tseries plotly ggridges shades urca
colorize <- function(x, color) {
if (knitr::is_latex_output()) {
sprintf("\\textcolor{%s}{%s}", color, x)
} else if (knitr::is_html_output()) {
sprintf("<span style='color: %s;'>%s</span>", color,
x)
} else x
}#webshot::install_phantomjs()The dataset used for this project is Daily
Delhi Climate, which consists of the following columns: 1.
date: Date of format YYYY-MM-DD starting from “2013-01-01”
and ending in “2017-01-01”.
2. meantemp: Mean
temperature averaged out from multiple 3 hour intervals in a day.
3. humidity: Humidity value for the day (units are grams of
water vapor per cubic meter volume of air).
4.
wind_speed: Wind speed measured in kmph.
5.
mean_pressure: Pressure reading of weather (measure in
atm)
Q1. How can I find out if analyzing meantemp indvidiually (univariate) suffices, or if including other columns and perform a multivariate analysis would worth the effort to find more meaninful patterns and forecasts?
The goal of this project is to analyze and forecast the mean temperature Delhi, which is recorded in the meantemp column. For this, after importing the dataset, outliers are removed in Preprocessing Section section. Then meantemp column is assigned to a time series object in Construct Time Series section for further processing, analysis, and forecast. After detecting seasonalities using plots, the time series is seasonally adjusted using X13-ARIMA-SEATS. Then, remaining trend is removed in detrend.
Before forecasting the time series, I check for stationarity of time series, as stationarity is an assumption in ARIMA model. For this purpose, unit root tests are applied in Stationarity section.
Finally, I used ARIMA model to forecast the time series in Forecast Time Series section.
df_train <- read_csv("data/DailyDelhiClimateTrain.csv")## Rows: 1462 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): meantemp, humidity, wind_speed, meanpressure
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_test <- read_csv("data/DailyDelhiClimateTest.csv")## Rows: 114 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): meantemp, humidity, wind_speed, meanpressure
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(df_train)## date meantemp humidity wind_speed
## Min. :2013-01-01 Min. : 6.00 Min. : 13.43 Min. : 0.000
## 1st Qu.:2014-01-01 1st Qu.:18.86 1st Qu.: 50.38 1st Qu.: 3.475
## Median :2015-01-01 Median :27.71 Median : 62.62 Median : 6.222
## Mean :2015-01-01 Mean :25.50 Mean : 60.77 Mean : 6.802
## 3rd Qu.:2016-01-01 3rd Qu.:31.31 3rd Qu.: 72.22 3rd Qu.: 9.238
## Max. :2017-01-01 Max. :38.71 Max. :100.00 Max. :42.220
## meanpressure
## Min. : -3.042
## 1st Qu.:1001.580
## Median :1008.563
## Mean :1011.105
## 3rd Qu.:1014.945
## Max. :7679.333
df_train |> describe()## df_train
##
## 5 Variables 1462 Observations
## --------------------------------------------------------------------------------
## date
## n missing distinct Info Mean Gmd .05
## 1462 0 1462 1 2015-01-01 487.7 2013-03-15
## .10 .25 .50 .75 .90 .95
## 2013-05-27 2014-01-01 2015-01-01 2016-01-01 2016-08-07 2016-10-19
##
## lowest : 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05
## highest: 2016-12-28 2016-12-29 2016-12-30 2016-12-31 2017-01-01
## --------------------------------------------------------------------------------
## meantemp
## n missing distinct Info Mean Gmd .05 .10
## 1462 0 617 1 25.5 8.331 12.51 14.62
## .25 .50 .75 .90 .95
## 18.86 27.71 31.31 33.71 35.43
##
## lowest : 6.000000 7.000000 7.166667 7.400000 8.666667
## highest: 38.200000 38.272727 38.428571 38.500000 38.714286
## --------------------------------------------------------------------------------
## humidity
## n missing distinct Info Mean Gmd .05 .10
## 1462 0 897 1 60.77 18.96 29.00 36.21
## .25 .50 .75 .90 .95
## 50.38 62.62 72.22 81.85 86.99
##
## lowest : 13.42857 15.85714 18.46667 19.00000 19.50000
## highest: 96.12500 96.62500 96.85714 98.00000 100.00000
## --------------------------------------------------------------------------------
## wind_speed
## n missing distinct Info Mean Gmd .05 .10
## 1462 0 736 1 6.802 4.878 0.925 1.625
## .25 .50 .75 .90 .95
## 3.475 6.222 9.238 12.608 14.813
##
## lowest : 0.0000000 0.4625000 0.4750000 0.5285714 0.6166667
## highest: 27.7750000 30.6857143 33.3250000 34.4875000 42.2200000
## --------------------------------------------------------------------------------
## meanpressure
## n missing distinct Info Mean Gmd .05 .10
## 1462 0 626 1 1011 22.92 996.8 998.1
## .25 .50 .75 .90 .95
## 1001.6 1008.6 1014.9 1017.8 1019.3
##
## lowest : -3.041667 12.045455 310.437500 633.900000 938.066667
## highest: 1022.125000 1023.000000 1350.296296 1352.615385 7679.333333
##
## Value 0 300 600 900 1000 1400 7700
## Frequency 2 1 1 2 1453 2 1
## Proportion 0.001 0.001 0.001 0.001 0.994 0.001 0.001
##
## For the frequency table, variable is rounded to the nearest 100
## --------------------------------------------------------------------------------
Below we can see interactive plot of the time series.
p <- df_train |>
ggplot( aes(x=date, y=meantemp)) +
geom_area(fill="#69b3a2", alpha=0.5) +
geom_line(color="#69b3a2") +
ylab("bitcoin price ($)") +
theme_ipsum()
# Turn it interactive with ggplotly
p <- ggplotly(p)
#p
pWe can detect an outlier at the last observation (last row of dataframe). It causes an abrupt decrease in value of temperature. This would lead to problems in further analysis I will proceed and also when I later apply functions on the time series. Therefore, I replace the last observation’s value with its previous one.
Q2. Is imputing with the last observation the right approach here? Or is preferrable that I use median of last n (rows) for instance? Maybe it won’t matter, as when I aggregate data (per week, month, week, etc.), I remove the last observation.
previous_value <- df_train$meantemp[df_train$date == as.Date('2016-12-31')]
df_train$meantemp[df_train$date == as.Date('2017-01-01')]<- previous_value #df_train <- head(df_train, -1)
head(df_train)## # A tibble: 6 × 5
## date meantemp humidity wind_speed meanpressure
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2013-01-01 10 84.5 0 1016.
## 2 2013-01-02 7.4 92 2.98 1018.
## 3 2013-01-03 7.17 87 4.63 1019.
## 4 2013-01-04 8.67 71.3 1.23 1017.
## 5 2013-01-05 6 86.8 3.7 1016.
## 6 2013-01-06 7 82.8 1.48 1018
tail(df_train)## # A tibble: 6 × 5
## date meantemp humidity wind_speed meanpressure
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2016-12-27 16.8 67.6 8.34 1017.
## 2 2016-12-28 17.2 68.0 3.55 1016.
## 3 2016-12-29 15.2 87.9 6 1017.
## 4 2016-12-30 14.1 89.7 6.27 1018.
## 5 2016-12-31 15.1 87 7.32 1016.
## 6 2017-01-01 15.1 100 0 1016
Let us how the plot looks after removing the outlier:
p <- df_train |>
ggplot( aes(x=date, y=meantemp)) +
geom_area(fill="#69b3a2", alpha=0.5) +
geom_line(color="#69b3a2") +
ylab("bitcoin price ($)") +
theme_ipsum()
# Turn it interactive with ggplotly
p <- ggplotly(p)
#p
pFind if there is any missing dates
date_range <- seq(min(df_train$date), max(df_train$date), by = 1)
date_range[!date_range %in% df_train$date] ## Date of length 0
summary(df_test)## date meantemp humidity wind_speed
## Min. :2017-01-01 Min. :11.00 Min. :17.75 Min. : 1.387
## 1st Qu.:2017-01-29 1st Qu.:16.44 1st Qu.:39.62 1st Qu.: 5.564
## Median :2017-02-26 Median :19.88 Median :57.75 Median : 8.069
## Mean :2017-02-26 Mean :21.71 Mean :56.26 Mean : 8.144
## 3rd Qu.:2017-03-26 3rd Qu.:27.71 3rd Qu.:71.90 3rd Qu.:10.069
## Max. :2017-04-24 Max. :34.50 Max. :95.83 Max. :19.314
## meanpressure
## Min. : 59
## 1st Qu.:1007
## Median :1013
## Mean :1004
## 3rd Qu.:1017
## Max. :1023
df_test |> describe()## df_test
##
## 5 Variables 114 Observations
## --------------------------------------------------------------------------------
## date
## n missing distinct Info Mean Gmd .05
## 114 0 114 1 2017-02-26 38.33 2017-01-06
## .10 .25 .50 .75 .90 .95
## 2017-01-12 2017-01-29 2017-02-26 2017-03-26 2017-04-12 2017-04-18
##
## lowest : 2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05
## highest: 2017-04-20 2017-04-21 2017-04-22 2017-04-23 2017-04-24
## --------------------------------------------------------------------------------
## meantemp
## n missing distinct Info Mean Gmd .05 .10
## 114 0 105 1 21.71 7.226 13.22 14.75
## .25 .50 .75 .90 .95
## 16.44 19.88 27.71 31.00 32.67
##
## lowest : 11.00000 11.72222 11.78947 12.11111 13.04167
## highest: 32.90000 33.50000 34.00000 34.25000 34.50000
## --------------------------------------------------------------------------------
## humidity
## n missing distinct Info Mean Gmd .05 .10
## 114 0 109 1 56.26 21.97 26.74 29.49
## .25 .50 .75 .90 .95
## 39.62 57.75 71.90 78.42 82.20
##
## lowest : 17.75000 19.42857 21.12500 24.12500 26.00000
## highest: 83.52632 84.44444 85.86957 91.64286 95.83333
## --------------------------------------------------------------------------------
## wind_speed
## n missing distinct Info Mean Gmd .05 .10
## 114 0 109 1 8.144 4.031 2.842 3.715
## .25 .50 .75 .90 .95
## 5.564 8.069 10.069 13.464 14.353
##
## lowest : 1.387500 1.625000 1.854545 1.950000 2.100000
## highest: 14.384615 15.512500 16.662500 17.590000 19.314286
## --------------------------------------------------------------------------------
## meanpressure
## n missing distinct Info Mean Gmd .05 .10
## 114 0 109 1 1004 23.16 1002 1004
## .25 .50 .75 .90 .95
## 1007 1013 1017 1019 1021
##
## lowest : 59.000 998.625 999.875 1000.875 1001.600
## highest: 1021.375 1021.556 1021.789 1021.958 1022.810
##
## Value 60 1000 1010 1020
## Frequency 1 14 53 46
## Proportion 0.009 0.123 0.465 0.404
##
## For the frequency table, variable is rounded to the nearest 10
## --------------------------------------------------------------------------------
xts_test_meantemp <- xts(df_test$meantemp, order.by=df_test$date, "%Y-%m-%d")head(xts_test_meantemp)## [,1]
## 2017-01-01 15.91304
## 2017-01-02 18.50000
## 2017-01-03 17.11111
## 2017-01-04 18.70000
## 2017-01-05 18.38889
## 2017-01-06 19.31818
tail(xts_test_meantemp)## [,1]
## 2017-04-19 33.500
## 2017-04-20 34.500
## 2017-04-21 34.250
## 2017-04-22 32.900
## 2017-04-23 32.875
## 2017-04-24 32.000
ts_plot(xts_test_meantemp)ts_test_meantemp <- ts_ts(xts_test_meantemp)
xts_week_test_meantemp <- apply.weekly(xts_test_meantemp,sum)
ts_week_test_meantemp <- ts_ts(xts_week_test_meantemp)ts_plot(xts_week_test_meantemp)Now we use the meantemp column to create our time series
data. I assigned the time series to xts objects. But since many
functions later require ts object, each time I define an xts, I also
convert it to ts object using tsbox::ts_ts
min(df_train$date)## [1] "2013-01-01"
max(df_train$date)## [1] "2017-01-01"
#ts_train <- zoo(df_train$meantemp, df_train$date)
xts_train_meantemp <- xts(df_train$meantemp, order.by=df_train$date, "%Y-%m-%d")
class(xts_train_meantemp)## [1] "xts" "zoo"
head(xts_train_meantemp)## [,1]
## 2013-01-01 10.000000
## 2013-01-02 7.400000
## 2013-01-03 7.166667
## 2013-01-04 8.666667
## 2013-01-05 6.000000
## 2013-01-06 7.000000
tail(xts_train_meantemp)## [,1]
## 2016-12-27 16.85000
## 2016-12-28 17.21739
## 2016-12-29 15.23810
## 2016-12-30 14.09524
## 2016-12-31 15.05263
## 2017-01-01 15.05263
# convert xts to ts
## Create a daily Date object for ts
#inds <- seq(as.Date("2013-01-01"), as.Date("2017-01-01"), by = "day")
#set.seed(25)
#ts_train <- ts(df_train$meantemp, # random data
# start = c(2013, as.numeric(format(inds[1], "%j"))),
# frequency = 365)
#ts_train <- ts(df_train$meantemp, start = decimal_date(ymd("2013-01-01")), frequency = 365.25 / 7)Convert XTS objects to TS objects:
ts_train_meantemp <-ts_ts(xts_train_meantemp)
head(ts_train_meantemp)## Time Series:
## Start = 2013
## End = 2013.01368953503
## Frequency = 365.2425
## [1] 10.000000 7.400000 7.166667 8.666667 6.000000 7.000000
tail(ts_train_meantemp)## Time Series:
## Start = 2016.98639260218
## End = 2017.00008213721
## Frequency = 365.2425
## [1] 16.85000 17.21739 15.23810 14.09524 15.05263 15.05263
I plot a static plot as well:
ts_plot(xts_train_meantemp)
## Seasonality {#seas}
From the initial plot I judge that there is yearly seasonality. For more delicate observation to find if there is more granular periods of seasonality, I use seasonality plots. Before that, I aggregate data weekly, monthly, and quarterly.
# Weekly mean temperature
xts_week_train_meantemp <- apply.weekly(xts_train_meantemp,sum)
ts_week_train_meantemp <-ts_ts(xts_week_train_meantemp)
# Monthly mean temperature
xts_mon_train_meantemp <- aggregate(xts_train_meantemp, by=as.yearmon, FUN=sum)
ts_mon_train_meantemp <-ts_ts(xts_mon_train_meantemp)
# Quarterly mean temperature
xts_quar_train_meantemp <- aggregate(xts_train_meantemp, as.yearqtr, FUN=sum)
ts_quar_train_meantemp <-ts_ts(xts_quar_train_meantemp)
# Yearly mean temperate
as.year <- function(x) as.integer(as.yearmon(x))
xts_year_train_meantemp <- aggregate(xts_train_meantemp, by=as.year, FUN=sum)
#ts_year_train_meantemp <-ts_ts(xts_year_train_meantemp)
#xts_year_train_meantemp[1]The year 2017 has only one observation, so I remove it from all the
aggregated datasets. I couldn’t do it before aggregating, otherwise I
would have confronted the error
Error: series has no regular pattern.
xts_week_train_meantemp <- head(xts_week_train_meantemp, -1)
xts_mon_train_meantemp <- head(xts_mon_train_meantemp, -1)
xts_quar_train_meantemp <- head(xts_quar_train_meantemp, -1)
ts_week_train_meantemp <- head(ts_week_train_meantemp, -1)
ts_mon_train_meantemp <- head(ts_mon_train_meantemp, -1)
ts_quar_train_meantemp <- head(ts_quar_train_meantemp, -1)#options(repr.plot.width = 7, repr.plot.height =20)
forecast::ggseasonplot(ts_mon_train_meantemp, year.labels=TRUE, year.labels.left=TRUE, labelgap = 0.1) +
ylab("degree") +
ggtitle("Seasonal plot: Monthly Mean Temperature")forecast::ggseasonplot(ts_mon_train_meantemp, year.labels=TRUE, year.labels.left=TRUE, labelgap = 0.1, polar=TRUE) +
ylab("degree") +
ggtitle("Polar Seasonal plot: Monthly Mean Temperature")#options(repr.plot.width = 7, repr.plot.height =20)
forecast::ggseasonplot(ts_quar_train_meantemp, year.labels=TRUE, year.labels.left=TRUE, labelgap = 0.1) +
ylab("degree") +
ggtitle("Seasonal plot: Quarterly Mean Temperature")forecast::ggseasonplot(ts_quar_train_meantemp, year.labels=TRUE, year.labels.left=TRUE, labelgap = 0.1, polar=TRUE) +
ylab("degree") +
ggtitle("Polar Seasonal plot: Quarterly Mean Temperature")
### Deseasonalize {#deseas}
If I need to remove different periods of seasonality together, I
would need to use the forecast:msts function. For instance
in below I remove weekly and yearly seasonality together.
des_ts_train_meantemp <- msts(xts_train_meantemp,seasonal.periods = c(7,365))
#head(des_xts_train)
#library(tsbox)
#ts_train <-ts_ts(xts_train)
#ts_train
class(des_ts_train_meantemp)## [1] "msts" "ts"
However, since its output had an unfamiliar and weird shape to me,
and also since I wasn’t sure it uses the state-of-the-art X13
decomposition, I incorporated the X-13ARIMA-SEATS
using seasonal:seas function. However, it has some
limitations, as stated in the package’s reference
manua. For instance, the number of observations must not exceed 780.
Nor should maximum seasonal period exceed 12. That is why I couldn’t use
original data ts_train and also the weekly aggregated data
ts_week_train, as I would confront the error
Seasonal period too large. The only possible aggregated
data with highest frequency possible was monthly aggregated,
ts_mon_train. However, I am concerned that I would lose
significant pattern and information with this amount of aggregation.
Q3. If you could kindly share your viewpoint here, it would be very helpful for me to ensure how to proceed.
length(xts_train_meantemp)## [1] 1462
length(ts_train_meantemp)## [1] 1462
length(xts_train_meantemp)## [1] 1462
nowXTS <-ts_xts(ts_train_meantemp)
length(nowXTS)## [1] 1462
length(ts_week_train_meantemp)## [1] 208
plot(ts_week_train_meantemp)length(ts_week_train_meantemp)## [1] 208
plot(ts_train_meantemp)length(ts_train_meantemp)## [1] 1462
plot(ts_mon_train_meantemp)length(ts_mon_train_meantemp)## [1] 48
m <- seas(ts_mon_train_meantemp)
ts_train_adj_meantemp <- final(m)
#ts_train_adj
length(ts_train_adj_meantemp)## [1] 48
m <- seas(ts_mon_train_meantemp)
ts_train_adj_meantemp <- final(m)
#ts_train_adj
length(ts_train_adj_meantemp)## [1] 48
plot(ts_train_adj_meantemp)Plot original data along with trend and seasonally adjusted data
#ts_train
#series(m, "forecast.forecasts")
#out(m)
#seasadj(m)
autoplot(ts_mon_train_meantemp, series="Original Data") +
autolayer(trendcycle(m), series="Trend") +
autolayer(seasadj(m), series="Seasonally Adjusted") +
xlab("Year") + ylab("Mean Temperature") +
ggtitle("Mean Temperature Decomposed using X13") +
scale_colour_manual(values=c("gray","blue","red"),
breaks=c("Original Data","Seasonally Adjusted","Trend"))#ap < ggplotly(ap)In the seasonally adjusted time series ts_train_adj, I
detected a trend, therefore I detrend it using differencing.
#ts_train_adj_meantemp |> log() |> nsdiffs(alpha=0.01) -> ts_train_adj_det_meantemp
ts_train_adj_meantemp |> log() |> diff() -> ts_train_adj_det_meantempplot(ts_train_adj_det_meantemp)#plot(d)ggAcf(ts_week_train_meantemp, lag=50)pacf (ts_week_train_meantemp, lag=50, pl = TRUE)ggAcf(ts_train_adj_meantemp, lag=10)pacf (ts_train_adj_meantemp, lag=10, pl = TRUE)ggAcf(ts_train_adj_det_meantemp, lag=10)pacf (ts_train_adj_det_meantemp, lag=10, pl = TRUE)ts_train_meantemp |> adf.test()##
## Augmented Dickey-Fuller Test
##
## data: ts_train_meantemp
## Dickey-Fuller = -1.9871, Lag order = 11, p-value = 0.5838
## alternative hypothesis: stationary
ts_week_train_meantemp |> adf.test()##
## Augmented Dickey-Fuller Test
##
## data: ts_week_train_meantemp
## Dickey-Fuller = -3.8729, Lag order = 5, p-value = 0.01651
## alternative hypothesis: stationary
ts_train_adj_meantemp |> adf.test() ##
## Augmented Dickey-Fuller Test
##
## data: ts_train_adj_meantemp
## Dickey-Fuller = -2.5897, Lag order = 3, p-value = 0.3386
## alternative hypothesis: stationary
ts_train_adj_det_meantemp |> adf.test() ## Warning in adf.test(ts_train_adj_det_meantemp): p-value smaller than printed p-
## value
##
## Augmented Dickey-Fuller Test
##
## data: ts_train_adj_det_meantemp
## Dickey-Fuller = -4.698, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
ts_train_meantemp |> ur.kpss() |> summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.5812
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ts_week_train_meantemp |> ur.kpss() |> summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 0.1618
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ts_train_adj_meantemp |> ur.kpss() |> summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 3 lags.
##
## Value of test-statistic is: 0.9974
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ts_train_adj_det_meantemp |> ur.kpss() |> summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 3 lags.
##
## Value of test-statistic is: 0.0595
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ts_train_meantemp |> ur.df() |> summary()##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression none
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6532 -0.8403 0.1233 1.0729 6.5542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 -0.001560 0.001622 -0.962 0.336
## z.diff.lag -0.158894 0.025836 -6.150 9.98e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.644 on 1458 degrees of freedom
## Multiple R-squared: 0.02616, Adjusted R-squared: 0.02483
## F-statistic: 19.59 on 2 and 1458 DF, p-value: 4.036e-09
##
##
## Value of test-statistic is: -0.9621
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau1 -2.58 -1.95 -1.62
seasonal=TRUE, as in cases 3m I use data that I
seasonally adjusted them already. Setting seasonal=TRUE
makes the model more time-consuming.#forecast_ts_train_meantemp = auto.arima(ts_train_meantemp,
# trace = TRUE,
# seasonal=TRUE,
# stepwise=FALSE,
# approximation=FALSE)
#checkresiduals(forecast_ts_train_meantemp)forecast_ts_train_meantemp <- auto.arima(ts_train_meantemp,
d = 1,
D = 1,
start.p = 2,
start.q = 3,
max.p = 2,
#max.d = 1,
max.q = 3,
start.P = 0,
start.Q = 0,
max.P = 0,
#max.D = 1,
max.Q = 0,
trace = TRUE,
seasonal=TRUE,
stepwise=TRUE,
approximation=FALSE
#nmodels=
)##
## ARIMA(2,1,3)(0,1,0)[365] : 4789.495
## ARIMA(0,1,0)(0,1,0)[365] : 4943.724
## ARIMA(1,1,0)(0,1,0)[365] : 4926.887
## ARIMA(0,1,1)(0,1,0)[365] : 4920.711
## ARIMA(1,1,3)(0,1,0)[365] : 4789.51
## ARIMA(2,1,2)(0,1,0)[365] : Inf
## ARIMA(1,1,2)(0,1,0)[365] : 4790.218
##
## Best model: ARIMA(2,1,3)(0,1,0)[365]
checkresiduals(forecast_ts_train_meantemp)##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,3)(0,1,0)[365]
## Q* = 363.76, df = 287, p-value = 0.001427
##
## Model df: 5. Total lags used: 292
#checkresiduals(forecast_ts_train_meantemp)
forecast_ts_train_meantemp## Series: ts_train_meantemp
## ARIMA(2,1,3)(0,1,0)[365]
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3
## 0.0104 0.4362 -0.2760 -0.6134 -0.0888
## s.e. 0.3402 0.2577 0.3395 0.3585 0.0480
##
## sigma^2 = 4.561: log likelihood = -2388.71
## AIC=4789.42 AICc=4789.5 BIC=4819.41
autoplot(forecast(forecast_ts_train_meantemp))# + autolayer(xts_test_meantemp)length(ts_test_meantemp)## [1] 114
forecast_ts_train_meantemp |> forecast(h=114) |>
autoplot() + autolayer(ts_test_meantemp)seasonal=TRUE, as in case 3, I
use data that I seasonally adjusted them already. Setting
seasonal=TRUE makes the model more time-consuming.#forecast_ts_week_train_meantemp = auto.arima(ts_week_train_meantemp,
# trace = TRUE,
# seasonal=TRUE,
# stepwise=FALSE,
# approximation=FALSE)
#checkresiduals(forecast_ts_week_train_meantemp)#autoplot(forecast(ts_week_train_meantemp)) #+ autolayer(ts_week_test_meantemp)forecast_ts_train_adj_meantemp = auto.arima(ts_train_adj_meantemp,
trace = TRUE,
seasonal= FALSE,
stepwise=FALSE,
approximation=FALSE)##
## ARIMA(0,1,0) : 441.3883
## ARIMA(0,1,0) with drift : 443.1154
## ARIMA(0,1,1) : 429.9577
## ARIMA(0,1,1) with drift : 429.5371
## ARIMA(0,1,2) : 432.2379
## ARIMA(0,1,2) with drift : 431.8095
## ARIMA(0,1,3) : 434.4112
## ARIMA(0,1,3) with drift : 433.7996
## ARIMA(0,1,4) : 434.6353
## ARIMA(0,1,4) with drift : Inf
## ARIMA(0,1,5) : Inf
## ARIMA(0,1,5) with drift : Inf
## ARIMA(1,1,0) : 432.8756
## ARIMA(1,1,0) with drift : 434.1589
## ARIMA(1,1,1) : 432.2366
## ARIMA(1,1,1) with drift : 431.7635
## ARIMA(1,1,2) : 434.6252
## ARIMA(1,1,2) with drift : Inf
## ARIMA(1,1,3) : 436.6632
## ARIMA(1,1,3) with drift : Inf
## ARIMA(1,1,4) : 436.8366
## ARIMA(1,1,4) with drift : Inf
## ARIMA(2,1,0) : 432.5313
## ARIMA(2,1,0) with drift : 433.449
## ARIMA(2,1,1) : 434.5063
## ARIMA(2,1,1) with drift : 433.7124
## ARIMA(2,1,2) : 436.9823
## ARIMA(2,1,2) with drift : Inf
## ARIMA(2,1,3) : Inf
## ARIMA(2,1,3) with drift : Inf
## ARIMA(3,1,0) : 434.8565
## ARIMA(3,1,0) with drift : 435.9464
## ARIMA(3,1,1) : 436.788
## ARIMA(3,1,1) with drift : Inf
## ARIMA(3,1,2) : Inf
## ARIMA(3,1,2) with drift : Inf
## ARIMA(4,1,0) : 436.5007
## ARIMA(4,1,0) with drift : 437.448
## ARIMA(4,1,1) : Inf
## ARIMA(4,1,1) with drift : 436.8826
## ARIMA(5,1,0) : 436.434
## ARIMA(5,1,0) with drift : 436.606
##
##
##
## Best model: ARIMA(0,1,1) with drift
checkresiduals(forecast_ts_train_adj_meantemp)##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1) with drift
## Q* = 7.4191, df = 9, p-value = 0.5936
##
## Model df: 1. Total lags used: 10
#forecast_ts_train_adj + ts_train_adj
autoplot(forecast(forecast_ts_train_adj_meantemp))forecast_ts_train_adj_det_meantemp = auto.arima(ts_train_adj_det_meantemp,
trace = TRUE,
seasonal= FALSE,
stepwise=FALSE,
approximation=FALSE)##
## ARIMA(0,0,0) with zero mean : -183.1965
## ARIMA(0,0,0) with non-zero mean : -181.4467
## ARIMA(0,0,1) with zero mean : -195.2457
## ARIMA(0,0,1) with non-zero mean : -195.724
## ARIMA(0,0,2) with zero mean : -192.9693
## ARIMA(0,0,2) with non-zero mean : -193.439
## ARIMA(0,0,3) with zero mean : -190.7534
## ARIMA(0,0,3) with non-zero mean : -191.4021
## ARIMA(0,0,4) with zero mean : -190.4085
## ARIMA(0,0,4) with non-zero mean : Inf
## ARIMA(0,0,5) with zero mean : Inf
## ARIMA(0,0,5) with non-zero mean : Inf
## ARIMA(1,0,0) with zero mean : -191.9675
## ARIMA(1,0,0) with non-zero mean : -190.6383
## ARIMA(1,0,1) with zero mean : -192.971
## ARIMA(1,0,1) with non-zero mean : -193.4768
## ARIMA(1,0,2) with zero mean : -190.5819
## ARIMA(1,0,2) with non-zero mean : Inf
## ARIMA(1,0,3) with zero mean : -188.4672
## ARIMA(1,0,3) with non-zero mean : Inf
## ARIMA(1,0,4) with zero mean : -188.2137
## ARIMA(1,0,4) with non-zero mean : Inf
## ARIMA(2,0,0) with zero mean : -192.6467
## ARIMA(2,0,0) with non-zero mean : -191.6913
## ARIMA(2,0,1) with zero mean : -190.668
## ARIMA(2,0,1) with non-zero mean : -191.4938
## ARIMA(2,0,2) with zero mean : -188.1949
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(2,0,3) with zero mean : Inf
## ARIMA(2,0,3) with non-zero mean : Inf
## ARIMA(3,0,0) with zero mean : -190.3099
## ARIMA(3,0,0) with non-zero mean : -189.1897
## ARIMA(3,0,1) with zero mean : -188.4346
## ARIMA(3,0,1) with non-zero mean : Inf
## ARIMA(3,0,2) with zero mean : Inf
## ARIMA(3,0,2) with non-zero mean : Inf
## ARIMA(4,0,0) with zero mean : -188.6455
## ARIMA(4,0,0) with non-zero mean : -187.6592
## ARIMA(4,0,1) with zero mean : Inf
## ARIMA(4,0,1) with non-zero mean : -188.2978
## ARIMA(5,0,0) with zero mean : -188.6625
## ARIMA(5,0,0) with non-zero mean : -188.4274
##
##
##
## Best model: ARIMA(0,0,1) with non-zero mean
checkresiduals(forecast_ts_train_adj_det_meantemp)##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1) with non-zero mean
## Q* = 7.1212, df = 8, p-value = 0.5236
##
## Model df: 1. Total lags used: 9
autoplot(forecast(forecast_ts_train_adj_det_meantemp))Q4. I did all unit root tests and ARIMA models on all the following datasets: original time series, deseasonalized time series, and combination of deaseonalized and detrended time series. Judging by the plots, the adjusted versions performed poorly when fed to the model compared to feeding the weekly aggregated data that is fed to AUTOARIMA but with autmatic seasonality modelling of seasonality. Could you please share your viewpoint regarding this?
Now we model a multivariate time series by using both the columns
meantemp and wind_speed.
We plot wind_speed time series first:
We use an interactive plot.
p2 <- df_train |>
ggplot( aes(x=date, y=wind_speed)) +
geom_area(fill="#69b3a2", alpha=0.5) +
geom_line(color="#69b3a2") +
ylab("bitcoin price ($)") +
theme_ipsum()
# Turn it interactive with ggplotly
p2 <- ggplotly(p2)
#p
p2Now we use a static plot.
#xts_train_meantemp <- xts(df_train$meantemp, order.by=df_train$date, "%Y-%m-%d")
#ts_train_meantemp <-ts_ts(xts_train_meantemp)
xts_train_windspeed <- xts(df_train$wind_speed, order.by=df_train$date, "%Y-%m-%d")
ts_train_windspeed <-ts_ts(xts_train_windspeed)ts_plot(ts_train_windspeed)We must deal with anomalies first.
xts_train_windspeed <- tsclean(xts_train_windspeed)ts_plot(xts_train_windspeed)In what follows, interactive plot of both time series are illustrated:
fig <- plot_ly(df_train, type = 'scatter', mode = 'lines')%>%
add_trace(x = ~date, y = ~meantemp, name = 'MeanTemp')%>%
add_trace(x = ~date, y = ~wind_speed, name = 'WindSpeed')%>%
layout(title = 'custom tick labels',legend=list(title=list(text='variable')),
xaxis = list(dtick = "M1", tickformat= "%b\n%Y"), width = 2000)## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
options(warn = -1)
fig <- fig %>%
layout(
xaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff', tickangle = 0),
yaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
plot_bgcolor='#e5ecf6')
figWe aggregate the windspeed time series by weekly
# Weekly mean temperature
xts_week_train_windspeed <- apply.weekly(xts_train_windspeed,sum)
ts_week_train_windspeed <- ts_ts(xts_week_train_windspeed)VAR_data <- ts.union(ts_train_meantemp, ts_train_windspeed)
colnames(VAR_data) <- cbind("meantemp","wind_speed")
#VAR_data <- window(ts.union(GDPGrowth, TSpread), start = c(1980, 3), end = c(2012, 4))
v1 <- cbind(xts_week_train_meantemp, xts_week_train_windspeed)
colnames(v1) <- cbind("meantemp","wind_speed")#lagselect <- VARselect(v1, type = "both")
#lagselect$selectionlagselect <- VARselect(VAR_data, type = "both")
lagselect$selection## AIC(n) HQ(n) SC(n) FPE(n)
## 8 6 4 8
Let us plot static plots for them individually as well:
ts_plot(ts_week_train_windspeed)ts_plot(xts_train_windspeed)Now that we have merged the column meantemp with
wind_speed, we use VAR models:
v1 <- na.remove(v1)
tail(v1)## meantemp wind_speed
## [203,] 153.7336 37.39762
## [204,] 159.8950 67.64741
## [205,] 143.0019 33.87750
## [206,] 128.2276 21.53095
## [207,] 125.0980 41.68784
## [208,] 119.8610 67.45303
VAR_est <- VAR(y = VAR_data, season=4, type="both")
VAR_est##
## VAR Estimation Results:
## =======================
##
## Estimated coefficients for equation meantemp:
## =============================================
## Call:
## meantemp = meantemp.l1 + wind_speed.l1 + const + trend + sd1 + sd2 + sd3
##
## meantemp.l1 wind_speed.l1 const trend sd1
## 9.770688e-01 -1.831148e-02 7.308115e-01 -2.433203e-05 9.295644e-02
## sd2 sd3
## 4.342123e-02 1.138176e-01
##
##
## Estimated coefficients for equation wind_speed:
## ===============================================
## Call:
## wind_speed = meantemp.l1 + wind_speed.l1 + const + trend + sd1 + sd2 + sd3
##
## meantemp.l1 wind_speed.l1 const trend sd1
## 0.1238160590 0.3760998074 1.4403595300 -0.0004801852 0.3097544851
## sd2 sd3
## 0.0327325939 0.5358723379
summary(VAR_est)##
## VAR Estimation Results:
## =========================
## Endogenous variables: meantemp, wind_speed
## Deterministic variables: both
## Sample size: 1461
## Log Likelihood: -6904.721
## Roots of the characteristic polynomial:
## 0.9733 0.3799
## Call:
## VAR(y = VAR_data, type = "both", season = 4L)
##
##
## Estimation results for equation meantemp:
## =========================================
## meantemp = meantemp.l1 + wind_speed.l1 + const + trend + sd1 + sd2 + sd3
##
## Estimate Std. Error t value Pr(>|t|)
## meantemp.l1 9.771e-01 6.268e-03 155.893 < 2e-16 ***
## wind_speed.l1 -1.831e-02 1.001e-02 -1.830 0.0675 .
## const 7.308e-01 1.679e-01 4.352 1.44e-05 ***
## trend -2.433e-05 1.039e-04 -0.234 0.8148
## sd1 9.296e-02 1.226e-01 0.758 0.4485
## sd2 4.342e-02 1.225e-01 0.354 0.7230
## sd3 1.138e-01 1.226e-01 0.928 0.3534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 1.656 on 1454 degrees of freedom
## Multiple R-Squared: 0.9492, Adjusted R-squared: 0.949
## F-statistic: 4532 on 6 and 1454 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation wind_speed:
## ===========================================
## wind_speed = meantemp.l1 + wind_speed.l1 + const + trend + sd1 + sd2 + sd3
##
## Estimate Std. Error t value Pr(>|t|)
## meantemp.l1 0.1238161 0.0151937 8.149 7.82e-16 ***
## wind_speed.l1 0.3760998 0.0242597 15.503 < 2e-16 ***
## const 1.4403595 0.4070498 3.539 0.000415 ***
## trend -0.0004802 0.0002518 -1.907 0.056692 .
## sd1 0.3097545 0.2972154 1.042 0.297497
## sd2 0.0327326 0.2969311 0.110 0.912237
## sd3 0.5358723 0.2972506 1.803 0.071632 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 4.013 on 1454 degrees of freedom
## Multiple R-Squared: 0.2284, Adjusted R-squared: 0.2253
## F-statistic: 71.75 on 6 and 1454 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## meantemp wind_speed
## meantemp 2.7409 0.2722
## wind_speed 0.2722 16.1075
##
## Correlation matrix of residuals:
## meantemp wind_speed
## meantemp 1.00000 0.04097
## wind_speed 0.04097 1.00000
summary(VAR_est$varresult)## Length Class Mode
## meantemp 12 lm list
## wind_speed 12 lm list
summary(VAR_est$varresult$meantemp)$adj.r.squared## [1] 0.9490318
summary(VAR_est$varresult$wind_speed)$adj.r.squared## [1] 0.2252582
forecasts <- predict(VAR_est, h=114)
plot(forecasts)#VAR_EQ1 <- dynlm(ts_train_meantemp ~ #L(ts_train_meantemp, 1:2) + L(ts_train_windspeed, 1:2), #
# start = c(2013-01-01), end = c(2012, #4))
#VAR_EQ2 <- dynlm(ts_train_windspeed ~ #L(ts_train_meantemp, 1:2) + L(ts_train_windspeed, 1:2),
# start = #decimal_date(ymd("2013-01-01")), end = start = #decimal_date(ymd("2017-01-01")))Granger_meantemp <- causality(VAR_est, cause = "meantemp")
Granger_meantemp## $Granger
##
## Granger causality H0: meantemp do not Granger-cause wind_speed
##
## data: VAR object VAR_est
## F-Test = 66.409, df1 = 1, df2 = 2908, p-value = 5.551e-16
##
##
## $Instant
##
## H0: No instantaneous causality between: meantemp and wind_speed
##
## data: VAR object VAR_est
## Chi-squared = 2.448, df = 1, p-value = 0.1177
Granger_windspeed <- causality(VAR_est, cause = "wind_speed")
Granger_windspeed## $Granger
##
## Granger causality H0: wind_speed do not Granger-cause meantemp
##
## data: VAR object VAR_est
## F-Test = 3.3482, df1 = 1, df2 = 2908, p-value = 0.06738
##
##
## $Instant
##
## H0: No instantaneous causality between: wind_speed and meantemp
##
## data: VAR object VAR_est
## Chi-squared = 2.448, df = 1, p-value = 0.1177
FEVD1 <- fevd(VAR_est, n.ahead = 50)
FEVD1## $meantemp
## meantemp wind_speed
## [1,] 1.0000000 0.000000000
## [2,] 0.9989928 0.001007226
## [3,] 0.9980521 0.001947932
## [4,] 0.9973726 0.002627385
## [5,] 0.9968980 0.003101993
## [6,] 0.9965606 0.003439371
## [7,] 0.9963133 0.003686674
## [8,] 0.9961262 0.003873790
## [9,] 0.9959805 0.004019484
## [10,] 0.9958643 0.004135748
## [11,] 0.9957695 0.004230468
## [12,] 0.9956910 0.004308988
## [13,] 0.9956250 0.004375034
## [14,] 0.9955687 0.004431279
## [15,] 0.9955203 0.004479687
## [16,] 0.9954783 0.004521731
## [17,] 0.9954415 0.004558537
## [18,] 0.9954090 0.004590982
## [19,] 0.9953802 0.004619757
## [20,] 0.9953546 0.004645417
## [21,] 0.9953316 0.004668409
## [22,] 0.9953109 0.004689100
## [23,] 0.9952922 0.004707793
## [24,] 0.9952753 0.004724739
## [25,] 0.9952598 0.004740152
## [26,] 0.9952458 0.004754211
## [27,] 0.9952329 0.004767068
## [28,] 0.9952211 0.004778854
## [29,] 0.9952103 0.004789684
## [30,] 0.9952003 0.004799654
## [31,] 0.9951911 0.004808850
## [32,] 0.9951827 0.004817347
## [33,] 0.9951748 0.004825211
## [34,] 0.9951675 0.004832499
## [35,] 0.9951607 0.004839263
## [36,] 0.9951545 0.004845549
## [37,] 0.9951486 0.004851397
## [38,] 0.9951432 0.004856843
## [39,] 0.9951381 0.004861921
## [40,] 0.9951333 0.004866660
## [41,] 0.9951289 0.004871087
## [42,] 0.9951248 0.004875224
## [43,] 0.9951209 0.004879096
## [44,] 0.9951173 0.004882720
## [45,] 0.9951139 0.004886116
## [46,] 0.9951107 0.004889299
## [47,] 0.9951077 0.004892285
## [48,] 0.9951049 0.004895088
## [49,] 0.9951023 0.004897720
## [50,] 0.9950998 0.004900192
##
## $wind_speed
## meantemp wind_speed
## [1,] 0.001678349 0.9983217
## [2,] 0.005323158 0.9946768
## [3,] 0.009991521 0.9900085
## [4,] 0.014905792 0.9850942
## [5,] 0.019713371 0.9802866
## [6,] 0.024294436 0.9757056
## [7,] 0.028619395 0.9713806
## [8,] 0.032689816 0.9673102
## [9,] 0.036517286 0.9634827
## [10,] 0.040116077 0.9598839
## [11,] 0.043500641 0.9564994
## [12,] 0.046684789 0.9533152
## [13,] 0.049681454 0.9503185
## [14,] 0.052502668 0.9474973
## [15,] 0.055159599 0.9448404
## [16,] 0.057662617 0.9423374
## [17,] 0.060021352 0.9399786
## [18,] 0.062244756 0.9377552
## [19,] 0.064341159 0.9356588
## [20,] 0.066318317 0.9336817
## [21,] 0.068183461 0.9318165
## [22,] 0.069943335 0.9300567
## [23,] 0.071604234 0.9283958
## [24,] 0.073172041 0.9268280
## [25,] 0.074652255 0.9253477
## [26,] 0.076050021 0.9239500
## [27,] 0.077370154 0.9226298
## [28,] 0.078617166 0.9213828
## [29,] 0.079795286 0.9202047
## [30,] 0.080908479 0.9190915
## [31,] 0.081960464 0.9180395
## [32,] 0.082954734 0.9170453
## [33,] 0.083894569 0.9161054
## [34,] 0.084783050 0.9152170
## [35,] 0.085623074 0.9143769
## [36,] 0.086417364 0.9135826
## [37,] 0.087168483 0.9128315
## [38,] 0.087878842 0.9121212
## [39,] 0.088550711 0.9114493
## [40,] 0.089186227 0.9108138
## [41,] 0.089787403 0.9102126
## [42,] 0.090356136 0.9096439
## [43,] 0.090894215 0.9091058
## [44,] 0.091403324 0.9085967
## [45,] 0.091885053 0.9081149
## [46,] 0.092340902 0.9076591
## [47,] 0.092772284 0.9072277
## [48,] 0.093180534 0.9068195
## [49,] 0.093566912 0.9064331
## [50,] 0.093932606 0.9060674
plot(FEVD1)set.seed(34)
# nnetar() requires a numeric vector or time series object as
# input ?nnetar() can be seen for more info on the function
# nnetar() by default fits multiple neural net models and
# gives averaged results xreg option allows for only numeric
# vectors in nnetar() function
fit_windspeed = nnetar(xts_train_windspeed)fit_windspeed## Series: xts_train_windspeed
## Model: NNAR(20,10)
## Call: nnetar(y = xts_train_windspeed)
##
## Average of 20 networks, each of which is
## a 20-10-1 network with 221 weights
## options were - linear output units
##
## sigma^2 estimated as 4.732
nnetforecast <- forecast(fit_windspeed, h = 114, PI = T) #Prediction intervals do not come by default in neural net forecasts, in constrast to ARIMA or exponential smoothing model
autoplot(nnetforecast) + theme(plot.title = element_text(hjust = 0.5))fit_meantemp = nnetar(xts_train_meantemp)fit_meantemp## Series: xts_train_meantemp
## Model: NNAR(11,6)
## Call: nnetar(y = xts_train_meantemp)
##
## Average of 20 networks, each of which is
## a 11-6-1 network with 79 weights
## options were - linear output units
##
## sigma^2 estimated as 2.118
nnetforecast <- forecast(fit_meantemp, h = 114, PI = T) #Prediction intervals do not come by default in neural net forecasts, in constrast to ARIMA or exponential smoothing model
autoplot(nnetforecast) + theme(plot.title = element_text(hjust = 0.5))